Modeling Run-Time Distributions in Passively Replicated Fault-Tolerant Systems
نویسنده
چکیده
Many real-time applications will have strict reliability requirements in addition to the timing requirements. To fulfill these reliability requirements, it may be necessary to use a fault-tolerance strategy. An active replication strategy, where several instances of the task is run in parallel, is the preferred choice for many real-time systems, as the parallel execution of the task instances gives a high probability that at least some of the instances finish successfully before the deadlines, even if others should fail. However, enabling several parallel executions of single tasks increase the need for processing power, which is costly and increases the requirements to space and energy consumption. In a passive replication strategy, only one instance of a task is run at a time. If the task fails, a backup is readied, and the task is rerun on the backup. This requires fewer resources than active replication strategies, but the extra time needed for the rerun of the task can increase the probability of deadline misses. Thus, analyzing the timing of these systems is necessary. Analysis using the worst-case execution times for the tasks in the fault tolerant system can often give very conservative results, especially if the tasks' normal execution times rarely approaches the worst case times. The analysis of the run-time distributions of the tasks in passively replicated fault tolerant systems can be a useful tool for deciding whether a passive repli-cation strategy is suitable for the system or not. Unlike worst-case execution time analysis, the distributions can also show the improvement in reliability for systems where the passive replication strategy does not work in the worst case scenario. This improvement may be so good that it justifies the use of the replication strategy. In this work, mathematical models for run-time distributions of tasks in several classes of passive replication systems are developed. These models give the run-time distributions as functions of parameter distributions of the modeled system, like the fault-free runtime of a task, the fault detection time distribution, and the distribution of the time between fault detection and the start of the rerun of the task. The different fault detection mechanisms used in passive replication systems lead to different structure of the mathematical models. Also, whether the replicas are homogeneous or inhomogeneous affect the model structure. Many other differences in the modeled systems' structure will lead to differences in the parameter distributions, but not in the structure of the …
منابع مشابه
Run-time Distributions in Passively Replicated Systems Using Timeout and Acceptance Fault Detection
Fault tolerance based on passive replication is common in many systems. If this kind of fault tolerance mechanism is to be used in a real-time system, timing analysis is necessary, as the fault tolerance mechanism itself may cause timing faults. There are different ways of detecting when the primary replica has failed, one of them is to use a timeout to detect crash and omission failures, anoth...
متن کاملImplementation of the GARF replicated objects platform
This paper presents the design and implementation of the GARF system, an object-oriented platform that helps programming fault-tolerant distributed applications in a modular way. The originality of GARF is to separate a distributed object into several objects, the complexity of distribution and fault-tolerance being encapsulated in reusable classes. The use of those classes by the GARF system i...
متن کاملA Replicated Monitoring Tool
Modeling the reliability of distributed systems requires a good understanding of the reliability of the components. Careful modeling allows highly fault-tolerant distributed applications to be constructed at the least cost. Realistic estimates can be found by measuring the performance of actual systems. An enormous amount of information about system performance can be acquired with no special p...
متن کاملFault-tolerant disk storage and file systems using reflective memory
Most replicated storage and file systems either take a specialized hardware approach or a sofhuare-oriented approach to fault tolerance. This paper describes a fault-tolerant disk storage and file system that falls in between the hardware and software categories. The system uses Reflective Memory to interconnect an array of standard computers comprising a massively parallel system. This archite...
متن کاملAn Approach for Fault-Tolerance in Hard Real-Time Distributed Systems
The presence of hard timing constraints makes the design of fault-tolerant systems difficult, because when tasks are replicated to treat errors, both the tasks replicas and the fault-tolerance building blocks (e.g. consensus) must be taken into account in the feasibility tests. This paper is devoted to the description of an approach for managing failures in hard real-time distributed systems. O...
متن کامل